5.3 Likelihood-Based Inference

#HypothesisTesting #BayesianInference #ScoreTest #WaldTest #ARE

1 Likelihood-Based Inference

Setting: $X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} p_{θ} (x)$ , $p_{θ} (x)$ is "smooth" in $θ$ . Assume $E_{θ} \nabla l_{1} (θ; X_{i}) = 0$ , ${Var}_{θ} [\nabla l_{1} (θ; X_{i})] = - E_{θ} \nabla^{2} l_{1} (θ; X_{i}) = J_{1} (θ)$ positive definitive. (see Fisher information), and MLE is consistent: ${\hat{θ}}_{MLE} \overset{p_{θ}}{\to} θ$ . Then if $θ = θ_{0}$ , $\begin{aligned} \frac{1}{\sqrt{n}} \nabla l_{n} (θ_{0}; X) & \Rightarrow N_{d} (0, J_{1} (θ_{0})), \\ \frac{1}{n} \nabla^{2} J_{n} (θ_{0}; X) & \overset{p}{\to} J_{1} (θ_{0}) . \end{aligned}$
Since we used $0 = \nabla l_{n} ({\hat{θ}}_{n}) \approx \nabla l_{n} (θ_{0}) + \nabla^{2} l_{n} (θ_{0}) ({\hat{θ}}_{n} - θ_{0})$ to get $\sqrt{n} ({\hat{θ}}_{n} - θ_{0}) \Rightarrow N_{d} (0, J_{1} (θ_{0})^{- 1})$ , then we can use this for inference on $θ_{0}$ !

1.1 Wald-Type Confidence Regions

Assume we have some estimator ${\hat{J}}_{n} ≻ 0$ s.t. $\frac{1}{n} {\hat{J}}_{n} \overset{p}{\to} J_{1} (θ_{0}) ≻ 0$ , then we plug in: if $\sqrt{n} ({\hat{θ}}_{n} - θ_{0}) \Rightarrow N_{d} (0, J_{1} (θ_{0})^{- 1})$ , then $(J_{1} (θ_{0}))^{\frac{1}{2}} \sqrt{n} ({\hat{θ}}_{n} - θ_{0}) \Rightarrow N_{d} (0, I_{d}),$ so by Slutsky's theorem ${\hat{J}}_{n}^{\frac{1}{2}} ({\hat{θ}}_{n} - θ_{0}) \Rightarrow N_{d} (0, I_{d}) .$ This leads to test of $H_{0} : θ = θ_{0}$ : $| | {\hat{J}}_{n}^{\frac{1}{2}} ({\hat{θ}}_{n} - θ_{0}) | |^{2} \Rightarrow χ_{d}^{2},$ so $P_{θ_{0}} ({\hat{J}}_{n}^{\frac{1}{2}} ({\hat{θ}}_{n} - θ_{0}) \geq χ_{d}^{2} (α)) \to α$ .
So we reject $θ_{0}$ iff $| | {\hat{J}}_{n}^{\frac{1}{2}} (\hat{θ_{n}} - θ_{0}) | |^{2} > χ_{d}^{2} (α)$ . $⟺$ reject $θ_{0}$ iff $θ_{0} \notin \underset{confidence ellipsoid}{\underset{⏟}{{\hat{θ}}_{n} + {\hat{J}}_{n}^{- \frac{1}{2}} B_{χ_{d}^{2} (α)} (0)}}$ . ${\hat{J}}_{n}^{- \frac{1}{2}} = O_{p} (1 / \sqrt{n}) \approx n^{- \frac{1}{2}} J_{1} (θ_{0})^{- \frac{1}{2}}$ .
For $d = 1$ , we reject $θ_{0}$ iff $| \sqrt{{\hat{J}}_{n}} ({\hat{θ}}_{n} - θ_{0}) | > z_{\frac{α}{2}}$ . $⟺$ reject $θ_{0}$ iff $θ_{0} \notin {\hat{θ}}_{n} \pm {\hat{J}}_{n}^{- \frac{1}{2}} z_{\frac{α}{2}}$ .

More info $⟺$ smaller ellipse (shrinks like $1 / \sqrt{n}$ )

Options for ${\hat{J}}_{n}$ :

Most obvious to plug in MLE for $J_{n} (θ)$ : $\begin{aligned} {\hat{J}}_{n} & = J_{n} ({\hat{θ}}_{n}) = {Var}_{θ} (\nabla l_{n} (θ; X)) |_{θ = {\hat{θ}}_{n}} \neq {Var}_{{\hat{θ}}_{n}} (\nabla l_{n} ({\hat{θ}}_{n}; X)) . \\ or {\hat{J}}_{n} & = - E_{θ} \nabla^{2} l_{n} (θ) |_{θ = {\hat{θ}}_{n}} . \end{aligned}$
Observed Fisher information: ${\hat{J}}_{n} = - \nabla^{2} l_{n} ({\hat{θ}}_{n}; X)$ .

Both have $\frac{1}{n} \overset{p}{\to} \frac{1}{n} J_{n} (θ_{0})$ (under regularity; cts second derivative; MLE consistent).
Both make sense outside of iid setting: $\frac{{\hat{J}}_{n}}{J_{n}} \overset{p}{\to} 1$ .
Heuristically, plug-in measures info about $θ$ in "typical" data set but observe info measures info about $θ$ in "this" data set.

Wald interval for $θ_{j}$ : $\begin{aligned} {\hat{θ}}_{n} \approx N_{d} (θ_{0}, J_{n} (θ_{0})^{- 1}), \\ {\hat{θ}}_{n, j} \approx N_{d} \underset{se ({\hat{θ}}_{n, j})^{2}}{\underset{⏟}{(θ_{0, j}, (J_{n} (θ_{0})^{- 1})_{j j})}}, \\ c_{j} = {\hat{θ}}_{n, j} \pm \sqrt{({\hat{J}}_{n}^{- 1})_{j j}} \cdot z_{\frac{α}{2}} . \end{aligned}$
Confidence ellipsoid: $θ_{0, S} = (θ_{0, j})_{j \in S}$ , $| S | = k \leq d$ . ${\hat{θ}}_{n, s} \approx N_{k} (θ_{0, S}, (J_{n} (θ_{0})^{- 1})_{S S})$ , $\to c_{S} = {\hat{θ}}_{n, S} + (({\hat{J}}_{n}^{- 1})_{S S})^{\frac{1}{2}} B_{χ_{k} (α)} (0)$ .
More generally, if ${\hat{θ}}_{n}$ is any consistent estimator with $\sqrt{n} ({\hat{θ}}_{n} - θ_{0}) \Rightarrow N_{d} (0, Σ (θ_{0})),$ and we have $\frac{1}{n} {\hat{Σ}}_{n} \overset{p_{θ_{0}}}{\to} Σ (θ_{0}) > 0, {\hat{Σ}}_{n}^{- \frac{1}{2}} ({\hat{θ}}_{n} - θ_{0}) \Rightarrow N_{d} (0, I_{d}) .$ ( ${\hat{θ}}_{n}$ not necessarily MLE)

Example (Generalized linear model, fixed

X

)

$X_{1}, \dots, X_{n} \in R^{d}$ fixed. $Y_{i}$ independent and $Y_{i} \sim p_{η_{i}} (y) = e^{η_{i} y_{i} - A (η_{i})} h (y_{i})$ , $η_{i} = β^{T} X_{i}$ (carnonical form).
Let $μ_{i} (β) = E_{β} Y_{i} = μ (η_{i} (B))$ .
Most common examples: Logistic regression $Y_{i} \overset{ind}{\sim} Bernoulli (\frac{e^{X_{i}^{T} β}}{1 + e^{X_{i}^{T} β}}) .$
Poisson regression: $Y_{i} \overset{ind}{\sim} Poisson (e^{X_{i}^{T} β}) .$

\begin{aligned} l_{n} (β; Y) & = \sum_{i} (X_{i}^{T} β)_{Y_{i}} - A (X_{i}^{T} β) - \log h (Y_{i}), \\ \nabla l_{n} (β; y) & = \sum_{i} Y_{i} X_{i} - \dot{A} (X_{i}^{T} β) \cdot X_{i} = \sum_{i} (Y_{i} - μ_{i} (β)) X_{i}, \\ - \nabla^{2} l_{n} (β; Y) & = \sum_{i} \ddot{A} (X_{i}^{T} β) \cdot X_{i} X_{i}^{T} = \sum_{i} {Var}_{β} (Y_{i}) \cdot X_{i} X_{i}^{T} \\ = {Var}_{β} (\nabla l_{n} (β; Y)) \\ (- \nabla^{2} l_{n} (β_{0}))^{- \frac{1}{2}} \nabla l_{n} (β_{0}) & \sim (0, I_{d}) \Rightarrow N_{d} (0, I_{d}) . \end{aligned}

Taylor expansion of $l_{n}$ : $\begin{aligned} {\hat{J}}_{n}^{\frac{1}{2}} ({\hat{β}}_{n} - β_{0}) \Rightarrow N_{d} (0, I_{d}), \\ {\hat{J}}_{n} = - \nabla^{2} l ({\hat{β}}_{n}; X) . \end{aligned}$

Advantages and Disadvantages
Advantages:

Easy to invert, simple confidence regions.
Asymptotically correct.
Disadvantages:
Have to compute MLE.
Depends on parameterization.
Relies on two approximations: $\nabla l_{n} \approx Normal, l_{n} \approx quadratic .$
Needs MLE to be consistent
Confidence interval/ellipsoid could go outside $Θ$ .

1.2 Score Test

Test : $H_{0} : θ = θ_{0} vs H_{1} : θ \neq θ_{0}$ .
We can bypass quadratic approximation entirely by using score as test statistic

\begin{aligned} \frac{1}{\sqrt{n}} \nabla l_{n} (θ_{0}; X) \overset{P_{θ_{0}}}{\to} N_{d} (0, J_{1} (θ_{0})), \\ or & J_{n} (θ_{0})^{- \frac{1}{2}} \nabla l_{n} (θ_{0}; X) \overset{P_{θ_{0}}}{\to} N_{d} (0, I_{d}) . \end{aligned}

We could reject $H_{0} : θ = θ_{0}$ if $| | J_{n} (θ_{0})^{- \frac{1}{2}} \nabla l_{n} (θ_{0}; X) | |^{2} \geq χ_{d}^{2} (α) \overset{d = 1}{⟹} \frac{{\dot{l}}_{n} (θ_{0})}{\sqrt{J_{n} (θ_{0})}} \sim N (0, 1) .$ Can do 1-sided tests.

No approx to $l_{n}$ , no MLE.

Don't need to estimate Fisher information at $θ_{0}$ .

Can be generalized to case with nuisance parameters. Typically estimate via MLE on $Θ_{0}$ .

Score test is invariant to reparameterization: assume $d = 1$ , $θ = g (ξ)$ , $g^{'} (ξ) > 0, \forall ξ$ . $q_{ξ} (x) = p_{g (ξ)} (x)$ . $\begin{aligned} l^{' (ξ)} (ξ; x) & = \frac{d}{d ξ} \log p_{g (ξ)} (x) = l^{' (θ)} (g (ξ); X) \cdot g^{'} (ξ), \\ J^{(ξ)} (ξ) & = J^{(θ)} (g (ξ)) \cdot g^{'} (ξ)^{2}, \end{aligned}$ so $\frac{l^{' (ξ)} (ξ_{0}; X)}{\sqrt{J^{(ξ)} (ξ_{0})}} \overset{a . s .}{=} \frac{l^{' (θ)} (θ_{0}; X)}{\sqrt{J^{(θ)} (θ_{0})}}$ if $θ_{0} = g (ξ_{0})$ .

s -

parameter exponential family

$X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} e^{η^{T} T (x) - A (η)} h (x)$ . So $\nabla l_{n} (η; X) = \sum_{i = 1}^{n} T (X_{i}) - n μ (η)$ , so ${‖ J_{n} (η_{0})^{- \frac{1}{2}} (\sum_{i = 1}^{n} T (X_{i}) - n μ (η_{0})) ‖}^{2} \Rightarrow χ_{d}^{2} .$ So $\frac{\sum_{i = 1}^{n} T (X_{i}) - n μ (η_{0})}{\sqrt{n {Var}_{η_{0}} (T (X_{i}))}} \overset{P_{η_{0}}}{\Rightarrow} N (0, 1) .$

Pearson's

χ^{2}

test (goodness of fit)

$\begin{aligned} N = (N_{1}, \dots, N_{d}) & \sim Multinomial (n; π_{1}, \dots, π_{d}) \\ = \frac{n! π_{1}^{N_{1}} \dots π_{d}^{N_{d}}}{N_{1}! \dots N_{d}!} 1 {\sum_{i = 1}^{n} N_{i} = n} . \end{aligned}$

Note $\sum_{i = 1}^{n} π_{i} = 1$ , so this is a full-rank $(d - 1) -$ parameter exponential family. E.g. $π_{j} = {\begin{aligned} \frac{1}{1 + \sum_{k > 1} e^{η_{k}}}, j = 1, \\ \frac{e^{η_{j}}}{1 + \sum_{k > 1} e^{η_{k}}}, j > 1. \end{aligned}$
So $\begin{aligned} \nabla l_{n} & = (N_{2}, \dots, N_{d}) - (n π_{2}, \dots, n π_{d}), \\ {Var}_{η} (\nabla l (η)) & = (\begin{array}{c} n π_{2} (1 - π_{2}) & \dots & - n π_{i} π_{j} \\ ⋮ & ⋱ & ⋮ \\ - n π_{i} π_{j} & \dots & n π_{d} (1 - π_{d}) \end{array}) \\ = n (diag (π_{2 : d}) - π_{2 : d} π_{2 : d}^{T}), \\ \Rightarrow J_{n} (η)^{- 1} & = \frac{1}{n} (diag (π_{2 : d})^{- 1} - π_{1}^{- 1} 11^{T}) . \end{aligned}$
Here we use $(A + u v^{T})^{- 1} = A^{- 1} - \frac{A^{- 1} u v^{T} A^{- 1}}{1 + v^{T} A^{- 1} u} .$ So the score test of $H_{0} : π = π_{0}$ is $\nabla l_{n} (η_{0}) J_{n}^{- 1} (η_{0}) \nabla l_{n} (η_{0}) = \sum_{j = 1}^{d} \frac{(N_{j} - n π_{0} j)^{2}}{n π_{0} j} \overset{P_{π_{0}}}{\Rightarrow} χ_{d - 1}^{2} .$

2 Generalized LRT

Test $H_{0} : θ = θ_{0} vs H_{1} : θ \neq θ_{0}$ . Taylor expand around ${\hat{θ}}_{n}$ :

\begin{aligned} l_{n} (θ_{0}) - l_{n} ({\hat{θ}}_{n}) & = \nabla l ({\hat{θ}}_{n}) + \frac{1}{2} (θ_{0} - {\hat{θ}}_{n})^{T} \nabla^{2} l_{n} ({\tilde{θ}}_{n}) (θ_{0} - \hat{θ_{n}}) \\ = - \frac{1}{2} {‖ {(- \frac{1}{n} \nabla^{2} l_{n} ({\tilde{θ}}_{n}))}^{\frac{1}{2}} (\sqrt{n} (θ_{0} - {\hat{θ}}_{n})) ‖}_{2}^{2} \\ \Rightarrow - \frac{1}{2} χ_{d}^{2} . \end{aligned}

Test statistic: $2 (l_{n} ({\hat{θ}}_{n}; X) - l_{n} (θ_{0}; X)) \overset{P_{θ_{0}}}{\Rightarrow} χ_{d}^{2}$ .

Consider $H_{0} : θ \in Θ_{0} vs H_{1} : θ \in Θ ∖ Θ_{0}$ , assume

$Θ = R^{d}$ , $Θ_{0}$ is a $d_{0}$ -dim manifold.
$θ_{0} \in relint (Θ_{0})$ .
${\hat{θ}}_{n} \overset{P_{θ_{0}}}{\to} θ_{0}$ .
Likelihood is "smooth".

Then $2 (l_{n} ({\hat{θ}}_{n}) - l_{n} ({\hat{θ}}_{0})) \Rightarrow χ_{d - d_{0}}^{2}$ , where ${\hat{θ}}_{0} = \arg min_{θ \in Θ_{0}} l_{n} (θ; X)$ .

Why? Assume WLOG $θ_{0} = 0$ , $J_{1} (0) = I_{d}$ . Then ${\hat{θ}}_{n} \approx N_{d} (θ_{0}, \frac{1}{n} I_{d})$ . And locally, $\nabla^{2} l_{n} (θ) \approx - n I_{d}$ near $θ_{0}$ , $\begin{aligned} l_{n} (θ) - l_{n} ({\hat{θ}}_{n}) & \approx \frac{n}{2} | | θ - {\hat{θ}}_{n} | |^{2} \\ {\hat{θ}}_{0} & \approx \arg min_{θ \in Θ_{0}} | | θ - {\hat{θ}}_{n} | | = {Proj}_{Θ_{0}} ({\hat{θ}}_{n}) \\ 2 (l_{n} ({\hat{θ}}_{0}) - l_{n} ({\hat{θ}}_{n})) & \approx n | | {\hat{θ}}_{n} - {Proj}_{Θ_{0}} ({\hat{θ}}_{n}) | |^{2} \\ = n | | {Proj}_{Θ_{0}}^{⊥} (\hat{θ_{n}}) | |^{2} \Rightarrow χ_{d - d_{0}}^{2} . \end{aligned}$

3 Asymptotic Equivalence

Recall quadratic approximation picture ( $d = 1$ ):

l_{n} (θ) - l_{n} (θ_{0}) \approx {\dot{l}}_{n} (θ_{0}) (θ - θ_{0}) + \frac{1}{2} J_{n} (θ_{0}) (θ - θ_{0})^{2} . $ $ F o r l a r g e $ n $, $ $ \begin{aligned} l_{n} ({\hat{θ}}_{n}) - l_{n} (θ_{0}) & \approx | | J_{n} (θ_{0})^{\frac{1}{2}} ({\hat{θ}}_{n} - θ_{0}) | |^{2} . \end{aligned}

Then Wald: $\approx | | {\hat{J}}_{n}^{\frac{1}{2}} ({\hat{θ}}_{n} - θ_{0}) | |^{2}$ ; Score: $| | J_{n} (θ_{0})^{- \frac{1}{2}} \nabla l_{n} (θ_{0}) | |^{2}$ .

4 Asymptotic Relative Efficiency (ARE)

Suppose ${\hat{θ}}_{n}^{(i)}, i = 1, 2$ are two asymptotic normal estimators of $θ_{0} \in R$ , with $\sqrt{n} ({\hat{θ}}_{n}^{(i)} - θ_{0}) \Rightarrow N (0, σ_{i}^{2}) .$ The ARE of ${\hat{θ}}^{(2)}$ w.r.t ${\hat{θ}}^{(1)}$ is $\frac{σ_{1}^{2}}{σ_{2}^{2}}$ . E.g. if $σ_{2}^{2} = 2 σ_{1}^{2}$ then ${\hat{θ}}^{(2)}$ is 50% as efficient.
Interpretation: suppose $\frac{σ_{1}^{2}}{σ_{2}^{2}} = γ \in (0, 1)$ . Then for large $n$ , ${\hat{θ}}_{[γ_{n}]}^{(1)} (X_{1}, \dots, X_{[γ_{n}]}) \overset{D}{\approx} {\hat{θ}}_{n}^{(2)} (X_{1}, \dots, X_{n}) \approx N (θ, \frac{σ_{2}^{2}}{n}) .$ Using ${\hat{θ}}^{(2)}$ is like throwing away $100 (1 - γ) %$ of the data and then using ${\hat{θ}}^{(1)}$ .